Exploring Bias in Baltimore 911 Calls Dataset

2019 IDIES Machine Learning Visualization Hackathon

Project by Darius Irani and Ronan Perry

In [1]:
import os
import pandas as pd
import geopandas as gpd
import pickle
import altair as alt
import json
import shapely.wkt
from datetime import datetime
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt

Project workflow consisted of:

  1. Combine shapefiles with different administrative partitions into single dataframe for Baltimore City
  2. Geocode coordinate-based location of incident locations from 911 calls with the associated geographic labels from previous dataframe
  3. Exploratory data analysis of 911 calls over different geographic partitions
  4. Visualize statistics in interactive chloropleth map
  5. Extract time series relationships from data and plot temporal (monthly) behavior

Future directions (did not get to):

  1. Pair time series line graph with interactivity of chloropleth plots
  2. Fit time series model to data
  3. Perform similar analysis to BPDArrests dataset and compare results
  4. Identify which call events led to results to determine bias in data

...and more!

Building geodataframe of Baltimore City containing all administrative geographic partitions

  • Combination of shapefiles of State of Maryland (congressional district, legislative district, zipcode, county borders) and Baltimore City (neighborhood, census tracts)
  • Extract an outline of Baltimore City from city shapefile, and spatially overlaid onto all Maryland shapefiles.
  • Merged all geodataframes together into one with
    • key = name of feature
    • category = type of geographic feature
    • geometry = shapely polygon objects

To perform, in terminal run

$ python process_geography.py filename pickle_name save_as_csv

The resulting dataframe is:

In [2]:
gdf = pickle.load(open('data/pickles/baltimore.pkl', 'rb'))
gdf.head()
Out[2]:
key category geometry
0 44B legislative_district (POLYGON ((-76.69935408318635 39.2726930176988...
1 21216 zipcode POLYGON ((-76.68187617433654 39.32673857186737...
2 21217 zipcode POLYGON ((-76.65254749813856 39.33080040407072...
3 21213 zipcode POLYGON ((-76.59452471396514 39.30428925921159...
4 21207 zipcode POLYGON ((-76.71119660109203 39.30293795702148...

Geocode coordinate-based location of incident locations from 911 calls

  • There were some formatting errors in the raw dataset
    • Location coordinates had to parsed out
    • Rows without a location coordinate had to be dropped

Raw data:

In [3]:
df = pd.read_csv('data/sociology/911_Police_Calls_for_Service.csv')
df.head()
Out[3]:
recordId callDateTime priority district description callNumber incidentLocation location
0 1423624 05/04/2016 09:58:00 PM High ND SILENT ALARM P161253035 400 WINSTON AV 400 WINSTON AV\nBALTIMORE, MD\n(39.349792, -76...
1 1402097 04/27/2016 03:57:00 PM Medium SW 911/HANGUP P161182081 1400 BRADDISH AV 1400 BRADDISH AV\nBALTIMORE, MD\n(39.303941, -...
2 1420176 05/03/2016 06:40:00 PM Medium ED DISORDERLY P161242705 200 E NORTH AV 200 E NORTH AV\nBALTIMORE, MD\n(39.311294, -76...
3 1423653 05/04/2016 10:10:00 PM Medium NE 911/NO VOICE P161253068 2500-1 HARFORD RD 2500 1 HARFORD RD\nBALTIMORE, MD\n(39.316763, ...
4 1417949 05/03/2016 12:29:00 AM Non-Emergency SD Private Tow P161240063 100 W PATAPSCO AV 100 W PATAPSCO AV\nBALTIMORE, MD\n(39.239215, ...
  • Spatial joins are used to determine which polygons from Baltimore City geodataframe coordinates fell within

To perform, in terminal run

$ python geocode_calls.py dataset pickle_name directory pickle_dir

Geocoded data:

In [4]:
df = pd.read_csv('data/sociology/911_geocoded.csv')
df.head()
Out[4]:
recordId callDateTime priority district description callNumber incidentLocation location census_tract congressional_district county legislative_district neighborhood zipcode
0 1423624 05/04/2016 09:58:00 PM High ND SILENT ALARM P161253035 400 WINSTON AV (39.349792, -76.613468) Census Tract 2711.01 3.0 Baltimore City 43 Radnor-Winston 21212.0
1 1402097 04/27/2016 03:57:00 PM Medium SW 911/HANGUP P161182081 1400 BRADDISH AV (39.303941, -76.66084) Census Tract 1607 7.0 Baltimore City 40 Winchester 21216.0
2 1420176 05/03/2016 06:40:00 PM Medium ED DISORDERLY P161242705 200 E NORTH AV (39.311294, -76.612806) Census Tract 1204 7.0 Baltimore City 43 Greenmount West 21202.0
3 1423653 05/04/2016 10:10:00 PM Medium NE 911/NO VOICE P161253068 2500-1 HARFORD RD (39.316763, -76.595269) Census Tract 805 7.0 Baltimore City 45 Darley Park 21218.0
4 1417949 05/03/2016 12:29:00 AM Non-Emergency SD Private Tow P161240063 100 W PATAPSCO AV (39.239215, -76.61151) Census Tract 2504.01 2.0 Baltimore City 46 Brooklyn 21225.0

Exploring concentration of 911 calls by neighborhood

  • Calculated the total amount of calls over time periods
  • Standardized values by the population of the neighborhood
  • Normalized data
In [5]:
def normByPopulation(pops,dframe,regionCol):
    standardizedCrimes = {}
    for _,row in pops.iterrows():
        name,pop = row['name'],row['population']
        dframe[dframe[regionCol]==name]
        try:
            standardizedCrimes[name] = len(dframe[dframe[regionCol]==name]) / pop
        except:
            pass
    mu = np.mean(list(standardizedCrimes.values()))
    std = np.std(list(standardizedCrimes.values()))
    normalizedCrimes = {i:(j-mu)/std for i,j in standardizedCrimes.items()}
    
    return(pd.DataFrame(list(normalizedCrimes.items()),columns=['neighborhood','normalized_crime']))
In [6]:
populations = pickle.load(open('data/pickles/population.pkl','rb'))
neighCounts = {}
regionCol = 'neighborhood'
for region in set(df[regionCol]):
    neighCounts[region] = len(df[df[regionCol]==region])
    
normalizedCrimes = normByPopulation(populations,df,'neighborhood')
normalizedCrimes.head()
Out[6]:
neighborhood normalized_crime
0 Abell -0.152084
1 Allendale -0.073369
2 Arcadia -0.276360
3 Boyd-Booth -0.107683
4 Arlington 0.324294

Interactive visualizations using Altair (built from the Vega-Lite JavaScript library)

In [7]:
def load_pickle(pickle_dir, pickle_name):
    return pickle.load(open(os.path.join(pickle_dir, pickle_name), 'rb'))

def merge_data(attribute_column, geography, chloropleth, pickle_dir):
    gdf = load_pickle(pickle_dir, geography)
    chloropleth = load_pickle(pickle_dir, chloropleth)
    chloropleth.columns = ['key', attribute_column]
    return gdf.merge(chloropleth, on='key', how='left')
    
def prepare_for_altair(attribute_column, geography, chloropleth, pickle_dir='data/pickles'):
    df = merge_data(attribute_column, geography, chloropleth, pickle_dir)
    gdf = gpd.GeoDataFrame(df, crs={'init' :'epsg:4326'}, geometry='geometry')
    gdf = gdf[['key', 'category', attribute_column, 'geometry']]
    json_gdf = gdf.to_json()
    json_features = json.loads(json_gdf)
    return alt.Data(values=json_features['features'])

def plot_map_altair(attribute_column, geography, chloropleth, pickle_dir='data/pickles'):
    data_geo = prepare_for_altair(attribute_column, geography, chloropleth, pickle_dir)
    multi = alt.selection_multi()
    highlight = alt.selection(type='single', on='mouseover',
                          fields=['symbol'], nearest=True)
    return alt.Chart(data_geo).mark_geoshape(
        fill='lightgray',
        stroke='white'
    ).properties(
        projection={'type': 'mercator'},
        width=550,
        height=600,
        selection=multi
    ).encode(
        color=alt.condition(multi, 'properties.' + attribute_column +':Q', alt.value('lightgray')),
        tooltip=('properties.key:O', 'properties.' + attribute_column +':Q')
    )

Chloropleth is neighborhood population in Baltimore

  • Click on a neighborhood to select, hold shift to select many
  • To clear selection, click outside of grayed out map
In [8]:
plot_map_altair('population','baltimore.pkl', 'population.pkl')
Out[8]:

Chloropleth is normalized 911 call concentration

Issues:

  • Missing neighborhoods are ones with population of 0 (can use above map to confirm)
    • To correct this, we would need transient population of the areas
  • Actual concentration is skewed by neighborhoods with (abnormally) low population with respect to call volume
In [9]:
plot_map_altair('calls','baltimore.pkl', 'normalized_crimes.pkl')
Out[9]:
  • Right now the interactivity doesn't so much for us. With more time, we would have increased functionality of the visualization with selection of neighborhoods triggering additional statistics to be shown, such as the time series lines from the plot below

Extract time series relationships from data and plot temporal (monthly) behavior

In [10]:
"""
Constructs time series vector of binned counts from depCol for the indepValue in the indepCol
"""
def binTimes(dframe,indepCol,indepValue,depCol,timeRange):    
    minTime,maxTime = timeRange
    ts = np.zeros(maxTime - minTime + 1)
    
    cnt = Counter(dframe[dframe[indepCol] == indepValue][depCol])
    for key,value in cnt.items():
        ts[key-minTime] = value
    return ts

def addCol(colName = 'week',stringFormat="%m/%d/%Y %I:%M:%S %p"):
    def time2week(dtime, minYear=2015):
        dt = datetime.strptime(dtime,stringFormat)
        return(dt.isocalendar()[1] + (dt.isocalendar()[0]-minYear)*52)
    
    def time2day(dtime, minYear=2015):
        dt = datetime.strptime(dtime,stringFormat)
        return(dt.isocalendar()[2] + dt.isocalendar()[1]*7 + (dt.isocalendar()[0]-minYear)*365)
    
    def time2month(dtime, minYear=2015):
        dt = datetime.strptime(dtime,stringFormat)
        return(dt.month + (dt.year-minYear)*12)
    
    if colName == 'week':
        func = time2week
    elif colName == 'day':
        func = time2day
    elif colName == 'month':
        func = time2month
    
    df[colName] = df.apply(lambda x: func(x.callDateTime),axis=1);
    timeRange = (min(df[colName]), max(df[colName]))
    
    return(df,timeRange)

"""
Constructs time series vector of binned counts from depCol for the indepValue in the indepCol
"""
def binDateTime(dframe,indepCol,indepValue,timeCol):    
    minTime = min(
        dframe[timeCol],
        key=lambda x: datetime.strptime(x, '%m/%d/%Y')
    )
    cnt = Counter(dframe[dframe[indepCol] == indepValue][depCol])
    for key,value in cnt.items():
        ts[key-minTime] = value
    return ts

def addDateResolution(dframe,colName='daily'):
    def convert(dtime,stringTo):
        dt = datetime.strptime(dtime,"%m/%d/%Y %I:%M:%S %p")
        return(dt.strftime(stringTo))
    
    if colName == 'daily':
        stringTo = "%m/%d/%Y"
        func=convert
    
    dframe[colName] = dframe.apply(lambda x: func(x.callDateTime,stringTo),axis=1);

    return(dframe)

def showTimeSeries(tsDict):
    plt.figure(figsize=(20,10))
    plt.xlabel('x', fontsize=15)
    plt.ylabel('y', fontsize=15)
    for key,val in tsDict.items():
        plt.plot(val,label=key)
    plt.legend()

Time series plot:

  • Currently way too many datapoints to extract useful information from, but by pairing with above interactive chloropleth maps, we could decrease the amount of data shown at once
In [11]:
colName = 'month'
df,timeRange = addCol(colName)

tsDict = {}
regionCol = 'neighborhood'
normalize=False

for region in set(df[regionCol]):
    if normalize:
        pop = populations[populations['name'] == region]['population']
        if pop.empty:
            continue
        pop = pop.iloc[0]
        if not pop == 0:
            tsDict[region] = binTimes(df,regionCol,region,colName,timeRange) / pop
    else:
        tsDict[region] = binTimes(df,regionCol,region,colName,timeRange)

tsMonthDict = tsDict.copy()
showTimeSeries(tsDict)